## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Which chemical properties influence the quality of red wines?
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
# using a for loop to print all the variables histograms
plots_uni <- list()
for (nm in names(wines)) {
plots_uni[[nm]] <- ggplot(aes_string(x=nm), data=wines ) +
geom_histogram( binwidth = .1 )
print(plots_uni[[nm]])
}
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
There are 1599 red wines in the dataset with 11 attributes: 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume)
Quality is the main feature because we are investigating which chemical properties influence the quality of red wines.
Quality looks like a normal distribution, lots of 5 and 6 rankings, less 4 and 7, very few 3 and 8
## wines$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wines$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wines$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wines$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wines$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wines$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
# lets plot all the boxplots for all combinations of variables
# using a for loop
plots_bi <- list()
for (i in 1:(length(names(wines))-1)) {
plots_bi[[nm]] <- ggplot (aes(x = quality, y = wines[,i] , group=quality), data = wines) +
geom_boxplot() +
ylab(names(wines)[i])
print(plots_bi[[nm]])
# fixed acidity does not appear to correlate with quality
# volatile acidity appears to negatively correlate with quality
# citric acid appears to positively correlate with quality
# residual sugar needs further investigation because of lots of
# outliers stretching the chart
#chlorides needs further investigation because of lots of
# outliers stretching the chart
# free sulfer dioxide does not appear to correlate with quality
# total sulfer dioxide does not appear to correlate with quality
# density appears to be negatively correlated with quality
# pH appears to be negatively correlated with quality
# Sulphates appears to be positively correlated with quality
# Alcohol appears to be negatively correlated with quality
}
##
## Pearson's product-moment correlation
##
## data: wines$volatile.acidity and wines$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: wines$citric.acid and wines$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: wines$sulphates and wines$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: wines$alcohol and wines$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
# calculate all correlations instead of one at a time
# using ggcor
ggcorr(wines, label=TRUE, hjust = 0.75, size = 3, color = "grey50", label_size = 3, layout.exp = 1, label_round = 2, label_alpha = TRUE)
# should have done this at the beginning of my analysis
Volatile Acidity and Quality (-0.39) Citric Acid and Quality (+0.23) Total SUlfur Dioxide and Quality (-0.19) Sulphates and Quality (+0.25) Alcohol and Quality (+0.48)
Volatile Acidity and Citric Acid (-0.55) Volatile Acidity and Sulphates (-0.26)
Fixed Acidity and Density (+0.67) Fixed Acidity and pH (-0.68) Citric Acid and pH (-0.54) Free Sulphur Dioxide and Total Sulpher Dioxide (0.67) Density and Alcohol (-0.50)
The strongest relationship with Quality were:
alcohol (0.476) Volatile Acidity (-0.390) sulphates (0.251) Citric Acid (0.226)
plots_mu <- list()
for (i in 1:(length(names(wines_sub))-1)) {
for (j in 1:(length(names(wines_sub))-1)) {
plots_mu[[nm]] <- ggplot(aes(x = wines_sub[,i] , y = wines_sub[,j],
color=factor(quality)), data = wines_sub) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div', palette = 'Spectral',
guide = guide_legend(title = 'Quality', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
xlab(names(wines_sub)[i]) +
ylab(names(wines_sub)[j])
print(plots_mu[[nm]])
}
}
The combination of high Sulphates and high alcohol seems to result in the highest Quality scores.
Both citric acid and volatile acidity had correlations with quality of red wine. Volatile acidity negatively correlated with quality. Citric acid positively correlated with quality. Citric acid is also negatively correlated with volatile acidity and positively correlated with fixed acidity.
Residual sugar was not correlated with Quality scores
ggplot(aes(x = quality), data = wines) +
geom_histogram(binwidth = 1) +
xlab("Quality") +
ylab("Number of Red Wines") +
ggtitle('Red Wine Quality counts')
The distribution of Red Wine Quality appears to be normal. A large majority of Red Wine in the dataset were given 5 or 6 Quality scores. There are very few Red Wines with scores 3 or 8. There are no scores less than 3 or greater than 8.
ggplot (aes(x = quality, y = alcohol , group=quality), data = wines) +
geom_boxplot(color = 'blue') +
geom_point(alpha = 0.1,
position = position_jitter(h=0),
color = 'grey50') +
xlab("Quality") +
ylab("Alcohol Percentage (%)") +
ggtitle('Alcohol Percentage (%) by Quality')
Red Wines with Quality of 3, 4, and 5 had low Alcohol Percentages, median around 10%. Increasing Quality Red Wines had increasing median Alcohol Percentages.
ggplot(aes(x = alcohol, y = sulphates, color=factor(quality)), data = wines) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div', palette = 'Spectral',
guide = guide_legend(title = 'Quality', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
ggtitle('Quality by Alcohol and Sulphates') +
xlab("Alcohol Percentage (%)") +
ylab("potassium sulphate (g/dm3)") +
scale_x_continuous(limits = c(9, 14)) +
scale_y_continuous(limits = c(0.3, 1.1)) +
geom_smooth(method='lm', se=FALSE, size=1)
Red Wines with Quality score of 3 typically had low Sulphates and low Alcohol Percentages. Higher levels of Sulphates and Alcohol Percentages resulted in higher scoring Quality in red wines.
The Red Wines data set contains information on almost 1,599 red wines. I started by understanding the individual variabltes in the data set using plots. Using many R libraries, I was able to determine which variables had a greatest impact on Quality and which variables were correlated.
I learned that citric acid correlated with quality of red wine. Citric acid is also negatively correlated with volatile acidity and positively correlated with fixed acidity. Any models would need to take this into account.
I struggled with figuring out which type of graph was best suited for this investigation, which resulted in creating more graphs than necessary, re-creating graphs and deleting graphes. I suspect with more experience I could spend more time planning which graphs I would need to draw conclusions instead of simply starting with graphs.
https://www.statmethods.net/management/subset.html https://stats.stackexchange.com/questions/177129/ggplot-and-loops
https://ourcodingclub.github.io/2017/02/08/funandloops.html http://rprogramming.net/rename-columns-in-r/
https://stackoverflow.com/questions/10085806/extracting-specific-columns-from-a-data-frame
http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html https://briatte.github.io/ggcorr/ https://bibinmjose.github.io/RedWineDataAnalysis/#correlation_matrix
https://stackoverflow.com/questions/31297196/continuous-value-supplied-to-discrete-scale http://ggplot.yhathq.com/docs/scale_color_brewer.html http://colorbrewer2.org/#type=diverging&scheme=Spectral&n=3